CorporAl: a Method and Tool for Handling Overlapping Parallel Corpora

نویسندگان

  • Mark Fishel
  • Heiki-Jaan Kaalep
چکیده

This work introduces amethod and tool for handling overlapping parallel corpora – i.e. corpora that are based on the same source material. The method is insensitive to minor changes in the text, different segmentation levels of the corpora and omitted material from either corpora. The aim is to detect matching sentence pairs and either produce combinations of the overlapping corpora or compare them and assess their quality in comparison to each other. The introduced tool enables the user to define the desired behavior when combining corpora pairs, resulting in pure comparison, maximum-size or maximum-quality versions of the combinations. We test the tool on two cases of overlapping parallel corpora and five language pairs. We also evaluate the impact of using the method on two translation systems – a phrase-based and a parsing-based one.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Experiments on Processing Overlapping Parallel Corpora

The number and sizes of parallel corpora keep growing, which makes it necessary to have automatic methods of processing them: combining, checking and improving corpora quality, etc. We here introduce a method which enables performing many of these by exploiting overlapping parallel corpora. The method finds the correspondence between sentence pairs in two corpora: first the corresponding langua...

متن کامل

Polyphraz: a tool for the quantitative and subjective evaluation of parallel corpora

The PolyphraZ tool is under construction in the framework of the TraCorpEx project (Translation of Corpora of Examples), for the management of parallel multilingual corpora (coding, format, correspondence). It is a software platform allowing the preparation and handling of parallel corpora (languages, codings...), parallel presentation, and addition of new languages to existing corpora by calli...

متن کامل

Better handling of a bilingual collection of texts

Statistical machine translation models are trained from parallel corpora, which are collections of translated texts. These texts are usually processed using dedicated tools called “sentence aligners”, which output parallel sentence pairs. However, parallel resources are very scarce in certain languages or domains. Alternative solutions have been proposed that extract parallel sentences from the...

متن کامل

Comparing Parallel Corpora and Evaluating their Quality

The availability of partially overlapping parallel corpora for a language pair opens up opportunities for automatically comparing, evaluating and improving them. We compare and evaluate the alignment quality of two English-Estonian parallel corpora that have been created independently, but contain overlapping texts. We describe how to determine the overlapping parts and find their alignment sim...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Prague Bull. Math. Linguistics

دوره 94  شماره 

صفحات  -

تاریخ انتشار 2010